Because it is a statistical language, there are a large number of probability distributions in R by default and an even larger number that can be loaded from packages. The table below gives a listing of the most common distributions in R, the name of the function within R, and the parameters of the distribution
| Distribution | R name | Parameters |
|---|---|---|
| beta | beta | shape1, shape2, ncp |
| Binomial | binom | size, prob |
| Cauchy | cauchy | location, scale |
| chi-squared | chisq | df, ncp |
| exponential | exp | rate |
| F | f | df1, df2, ncp |
| gamma | gamma | shape, scale |
| geometric | geom | prob |
| hypergeometric | hyper | m, n, k |
| log-normal | lnorm | meanlog, sdlog |
| logistic | logis | location, scale |
| Negative binomial | nbinom | size, prob |
| normal | norm | mean, sd |
| Poisson | pois | lambda |
| Student’s t | t | df, ncp |
| uniform | unif | min, max |
| Weibull | weibull | shape, scale |
| Wilcoxon | wilcox | m, n |
R actually provides four related functions for each probability distribution. These functions are called by adding a letter at the beginning of the function name. The variants of each probability distribution are:
The first argument to these functions is the same regardless of the distribution and is the value x for “d”, the quantile q for “p”, the probability p for “q”, and the sample size n for “r”
All of this will make more sense once we consider a concrete example. Let’s take a look at the normal probability density function first, since it’s the one you’re most familiar with. If you use ?dnorm you’ll see that for many of the function arguments there are default values, specifically mean=0 and sd=1. Therefore if these values are not specified explicitly in the function call R assumes you want a standard Normal distribution.
x = seq(-5,5,by=0.1)
plot(x,dnorm(x),type='l') ## that’s a lowercase “L” for “line”
abline(v=0) ## add a line to indicate the mean (“v” is for “vertical”)
lines(x,dnorm(x,2),col=2) ## try changing the mean (“col” sets the color)
abline(v=2,col=2)
lines(x,dnorm(x,-1,2),col=3) ## try changing the mean and standard dev
abline(v=-1,col=3)
You will use density functions quite frequently to estimate the likelihood of data. For example if we collected a data point of x = 0.5 we can calculate the likelihood that this data point came from each of these three normal distributions. Implicit in the likelihood is that we’re looking at the probability of the data in a very small window, \(\delta x\), and that \(\delta x\) never changes so that:
\[Pr(x < X < x + \delta x) = \int_{x}^{x + \delta x} f(x)dx \propto f(x)\]
dnorm(0.5,0,1)
## [1] 0.3521
dnorm(0.5,2,1)
## [1] 0.1295
dnorm(0.5,-1,2)
## [1] 0.1506
This shows that the first distribution has a higher likelihood than the other two, which are about the same. We interpret this as saying that the observed data point was more likely to have been generated by the first distribution than the other two. This is consistent with where the dashed blue line intersects each of the curves.
This plot of the normal distribution and the effects of varying the parameters in the normal are both probably familiar to you already – changing the mean changes where the distribution is centered while changing the standard deviation changes the spread of the distribution. Next try looking at the CDF of the normal:
plot(x,pnorm(x,0,1),type='l')
abline(v=0)
lines(x,pnorm(x,2,1),col=2)
abline(v=2,col=2)
lines(x,pnorm(x,-1,2),col=3)
abline(v=-1,col=3)
Using the CDF we can calculate \(Pr(X \leq x)\) for our data point
pnorm(0.5,0,1)
## [1] 0.6915
pnorm(0.5,2,1)
## [1] 0.06681
pnorm(0.5,-1,2)
## [1] 0.7734
If the value 0.5 here corresponded to some hypothesis, the CDF could be use to calculate a p-value associated with the one-sided test. Would any be significant at \(\alpha\)=0.05 significance? At \(\alpha\)=0.1?
Next let’s look at the function qnorm. Since the input to this function is a quantile, the x-values for the plot are restricted to the range [0,1].
p = seq(0,1,by=0.01)
plot(p,qnorm(p,0,1),type='l',ylim=range(x)) # ylim sets the y-axis range
# range returns the min/max as a 2-element vector
abline(h=0) # “h” for “horizontal”
lines(p,qnorm(p,2,1),col=2)
abline(h=2,col=2)
lines(p,qnorm(p,-1,2),col=3)
abline(h=-1,col=3)
It should be readily apparent that the quantile function is the inverse of the CDF. This function can be used to find the median of the distribution (p = 0.5) or to estimate confidence intervals at any level desired.
qnorm(c(0.025,0.975),0,1) # what width CI is specified by these values?
## [1] -1.96 1.96
plot(p,qnorm(p,0,1),type='l',ylim=range(x))
abline(v=c(0.025,0.975),lty=2) # add vertical lines at the CI
abline(h=qnorm(c(0.025,0.975)),lty=2) #add horizontal lines at the threshold vals
plot(x,dnorm(x,0,1),type='l') # plot the corresponding pdf
abline(v=qnorm(c(0.025,0.975)),lty=2)
Finally, let’s investigate the rnorm function for generating random numbers that have a normal distribution. Here we generate histograms that have a progressively larger sample size and compare that to the actual density of the standard normal.
n = c(10,100,1000,10000) # sequence of sample sizes
for(i in 1:4){ # loop over these sample sizes
hist(rnorm(n[i]),main=n[i],probability=TRUE,breaks=40)
#here breaks defines number of bins in the histogram
lines(x,dnorm(x),col=2)
}
To make these plots we introduced the use of a for loop. A for loop iterates over all specified values of some index. Above, we used i as our index, and specified values 1, 2, 3, and 4 with the 1:4 syntax from last week. for loops always start this way (including parentheses): “for(index in vector)”. The code in the { } then executes once for every value in the vector, with i taking on the next value in the list. The for syntax in R is very flexible and will allow you to loop over a vector of anything, not just integers…characters, factors, filenames, etc. Also note that your index variable can be anything, but to avoid unwanted results make sure it’s not the same as a variable you’re using elsewhere in the code (e.g. indexing by x above would ruin everything, because x would be overwritten on every iteration).
One other technical note: like any function in R that generates random output, this example will give different results every time you run it.
This example demonstrates that as the number of random draws from a probability distribution increases, the histogram of those draws provides a better and better approximation of the density itself. We will make use of this fact extensively this semester because – as odd as this may sound now – there are many distributions that are easier to randomly sample from than solve for analytically. We can also show that as the sample size increases, the sample mean and standard deviation get closer to the true population values (mean = 0, sd = 1 in this example) and the standard error gets smaller:
y = rnorm(10) ## Sample size = 10
y
## [1] 1.0999 0.5795 -0.7851 0.9304 -0.4635 -0.4619 -0.5039 0.2385
## [9] -0.3116 0.2796
mean(y)
## [1] 0.06019
sd(y) ## standard deviation
## [1] 0.6588
sd(y)/sqrt(10) ## standard error
## [1] 0.2083
y = rnorm(10000) ## Sample size = 10000
mean(y)
## [1] 0.01818
sd(y) ## standard deviation
## [1] 0.9966
sd(y)/sqrt(10000) ## standard error
## [1] 0.009966
We are next going to investigate several other common probability distributions. There will be a few examples of the effects of varying parameters for each distribution, but you are strongly recommended to explore and try other values to see how they affect the shape of each PDF. You might also want to refer to Appendix F in the textbook or Chapter 3 of Hilborne and Mangel in order to learn more about each distribution. There is also a good chart at http://www.johndcook.com/distribution_chart.html that describes the relationships among these common distributions, and the Wikipedia articles for most of them are good for quick reference.
There’s a good bit of code this week so you might want to cut and paste some of it into your script rather than retyping it. We recommend running each clump of code by itself, and especially that you make sure you understand what it did before proceeding to the next box.
*For each of the distributions below, please include the multi-panel figure generated in your lab report and provide a short answer to the questions that follow. If you write any additional code in deriving your answers (you may not need to in this lab), include it and any R outputs or figures it generates. When doing calculations “by hand,” include the formulas you used. You do not have to show all your work, but it may make it easier to assign partial credit.
For this lab, turn everything in as a single .Rmd file by the start of next lab.*
Note, if you didn’t start with the tutorial above, you’ll need to define x and p variables before running the codes below.
The uniform is used when there’s an equal probability of an event occurring over a range of values.
plot(x,dunif(x),type='l')
lines(x,dunif(x,0,4),col=2)
lines(x,dunif(x,-3,-0.5),col=3)
plot(x,punif(x),type='l')
lines(x,punif(x,0,4),col=2)
lines(x,punif(x,-3,-0.5),col=3)
plot(p,qunif(p),type='l',ylim=range(x))
lines(p,qunif(p,0,4),col=2)
lines(p,qunif(p,-3,-0.5),col=3)
hist(runif(500,-3,3),breaks=30,xlim=range(x),probability=TRUE)
lines(x,dunif(x,-3,3),col=2)
1. Why does the height of the uniform PDF change as the width changes?
2. What do the second and third arguments to the uniform specify? What are their default values?
The Beta is an interesting distribution because it is bound on the range 0 <= X <= 1 and thus is very often used to describe data that is a proportion. At first glance the mathematical formula for the Beta looks a lot like the Binomial:
\[Beta(x \mid a,b) \propto x^{a-1} (1-x)^{b-1}\] \[Binom(x \mid n,p) \propto p^x (1-p)^{n-p}\]
The critical difference between the two is that in the Beta the random variable X is in the base while in the Binomial it is in the exponent. Unlike many distributions you may be have used in the past, the two shape parameters in the Beta do not define the mean and variance, but these can be calculated as simple functions of \(\alpha\) and \(\beta\). The Beta does have an interesting property of symmetry though, whereby Beta(\(\alpha\), \(\beta\)) is the reflection of Beta(\(\beta\),\(\alpha\)).
plot(p,dbeta(p,5,5),type='l')
lines(p,dbeta(p,1,1),col=2)
lines(p,dbeta(p,0.2,0.2),col=3)
plot(p,dbeta(p,6,6),type='l',ylim=c(0,5))
lines(p,dbeta(p,6,4),col=2)
lines(p,dbeta(p,6,2),col=3)
lines(p,dbeta(p,6,1.25),col=4)
lines(p,dbeta(p,6,1),col=5)
lines(p,dbeta(p,6,0.25),col=6)
plot(p,pbeta(p,5,5),type='l')
lines(p,pbeta(p,1,1),col=2)
lines(p,pbeta(p,0.2,0.2),col=3)
hist(rbeta(500,3,3),breaks=30,xlim=range(p),probability=TRUE)
lines(p,dbeta(p,3,3),col=2)
lines(p,dbeta(p,3,3),col=2)
Lab questions:
3) The Beta has a special case, Beta(1,1) that is equivalent to what other PDF?
4) In the first panel, the mean is the same for each line (0.5). What are the variances? (Hint: you need to calculate this analytically. Look up the distribution in one of the recommended references.)
5) In the second panel, what are the means and medians of each of these curves? (Hint: you'll need to calculate the mean analytically and use one of the variants of R's beta function to find the median.)
The lognormal is a log transform of the normal distribution. It is defined on the range X > 0 so is commonly used for data that cannot be negative by definition. The distribution is also positively skewed so is often used for skewed data. One thing that often goes unappreciated with the log-normal is that the mean, E[X], depends on the variance:
\[E[X] = e^{\mu + {{1}\over{2}}\sigma^2}\]
This applies not just when you explicitly use the lognormal, but also whenever you log-transform data and then calculate a mean or standard deviation – a fact that is vastly under-appreciated in the biological and environmental sciences and frequently missed in the published literature. In fact, ANY data transformation applied to make data “more normal” will change the mean, with the functional form of the bias depending on the transformation used. You can not simply back-transform the data without correcting for this bias. This phenomena is another illustration of Jensen’s Inequality.
## changing the mean
x <- 10^seq(-2,2,by=0.01)
plot(x,dlnorm(x,0),type='l',xlim=c(0,15),main="Changing the Mean")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
## on a log scale
plot(x,dlnorm(x,0),type='l',log='x',main="Log Scale")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
abline(v=exp(0),col=1)
abline(v=exp(1),col=2)
abline(v=exp(2),col=3)
## changing the variance
plot(x,dlnorm(x,2,.125),type='l',xlim=c(0,20),ylim=c(0,0.6),main="Changing the Variance")
lines(x,dlnorm(x,2,0.25),col=2)
lines(x,dlnorm(x,2,0.5),col=3)
lines(x,dlnorm(x,2,1),col=4)
lines(x,dlnorm(x,2,2),col=5)
lines(x,dlnorm(x,2,4),col=6)
abline(v=exp(2),col=1)
## random sample
hist(rlnorm(250,2,1),breaks=30,probability=TRUE)
lines(x,dlnorm(x,2,1),col=4)
6) What are the arithmetric and geometric means of the three curves in the first panel? (Reminder: arithmetic means are means in the linear domain, geometric means are means in the log domain)
The exponential distribution arises naturally as the time it takes for an event to occur when the average rate of occurrence, r, is constant. The exponential is a special case of the Gamma (discussed next) where \(Exp(X \mid r) = Gamma(X \mid 1,r)\). The exponential is also a special case of the Weibull, \(Exp(X \min r) = Weibull(X \mid r,1)\), where the Weibull is a generalization of the exponential that allows the rate parameter r to increase or decrease with time. The Laplace is basically a two-sided exponential and arises naturally if one is dealing with absolute deviation, \(|x-m|\), rather than squared deviation, \((x-m)^2\), as is done with the normal.
## changing the mean
x <- seq(0,10,by=0.01)
plot(x,dexp(x,0.125),type='l',ylim=c(0,1))
lines(x,dexp(x,0.25),col=2)
lines(x,dexp(x,0.5),col=3)
lines(x,dexp(x,1),col=4)
lines(x,dexp(x,2),col=5)
lines(x,dexp(x,4),col=6)
## random sample
hist(rexp(250,2),breaks=30,probability=TRUE)
lines(x,dexp(x,2),col=4)
## laplace vs Gaussian
plot(x,dexp(abs(x-5),1)/2,type='l')
lines(x,dnorm(x,5),col=2)
plot(x,dexp(abs(x-5),1)/2,type='l',log='y') ## same plot as last but on a log scale
lines(x,dnorm(x,5),col=2)
7) The last two panels compare a normal and a Laplace distribution with the same mean and variance. How do the two compare? In particular, compare the difference in the probabilities of extreme events in each distribution.
The gamma and inverse-gamma distribution are flexible distributions defined for positive real numbers. These are frequently used to model the distribution of variances or precisions (precision = 1/variance), in which case the shape and rate parameters are related to the sample size and sum of squares, respectively. The gamma is frequently used in Bayesian statistics as a prior distribution, and also in mixture distributions for inflating the variance of another distribution
x <- seq(0,10,by=0.01)
## change rate
plot(x,dgamma(x,3,3),type='l' ,ylim=c(0,1.6))
lines(x,dgamma(x,3,1),col=2)
lines(x,dgamma(x,3,6),col=3)
lines(x,dgamma(x,3,1/3),col=4)
## change shape
plot(x,dgamma(x,3,3),type='l')
lines(x,dgamma(x,1,3),col=2)
lines(x,dgamma(x,6,3),col=3)
lines(x,dgamma(x,18,3),col=4)
## change variance
a <- c(20,15,10,5,2.5,1.25)
r <- c(4,3,2,1,0.5,0.25)
plot(x,dgamma(x,20,4),type='l')
for(i in 1:6){
lines(x,dgamma(x,a[i],r[i]),col=i)}
var = a/r^2
mean = a/r
legend("topright",legend=format(var,digits=3),lty=1,col=1:6)
var
## [1] 1.250 1.667 2.500 5.000 10.000 20.000
mean
## [1] 5 5 5 5 5 5
## change mean
var = 4
mean = c(1,2,3,4,5)
rate = mean/var
shape = mean^2/var
plot(x,dgamma(x,shape[2],rate[2]),type='l',col=2)
for(i in 1:5){
lines(x,dgamma(x,shape[i],rate[i]),col=i)
}
legend("topright",legend=1:5,lty=1,col=1:5)
rate
## [1] 0.25 0.50 0.75 1.00 1.25
shape
## [1] 0.25 1.00 2.25 4.00 6.25
The binomial arises naturally from counts of the number of successes given a probability of success, p, and a sample size, n. You are probably already familiar with the binomial in the context of coin toss examples.
x <- 0:11
## vary size of sample (number of draws)
size = c(1,5,10)
plot(x,dbinom(x,size[1],0.5),type='s')
for(i in 2:3){
lines(x,dbinom(x,size[i],0.5),type='s',col=i)
}
## vary probability
n = 10
p = c(0.1,0.25,0.5)
plot(x,dbinom(x,n,p[1]),type='s')
for(i in 2:3){
lines(x,dbinom(x,n,p[i]),col=i,type='s')
}
abline(v = n*p,col=1:3,lty=2)
## CDF
plot(x,pbinom(x,n,p[1]),type='s',ylim=c(0,1))
for(i in 2:3){
lines(x,pbinom(x,n,p[i]),col=i,type='s')
}
## Random samples
hist(rbinom(100,10,0.5)+0.0001, ## small amount added because
probability=TRUE, ## of way R calculates breaks
breaks=0:11,main="Random Binomial")
lines(x,dbinom(x,10,0.5),type='s',col=2) #Analytical solution
legend("topright",legend=c("sample","pmf"),lty=1,col=1:2)
10) Consider a binomial distribution that has a constant mean, np. What are the differences in the shape of this distribution if it has a high n and low p vs. a low n and high p?
The Poisson is also very common for count data and arises as the number of events that occur in a fixed amount of time (e.g. number of bird sightings per hour), or the number of items found in a fixed amount of space (e.g. the number of trees in a plot). Unlike the Binomial distribution the Poisson doesn’t have a fixed upper bound for the number of events that can occur.
x <- 0:12
plot(x,dpois(x,1),type='s')
lines(x,dpois(x,2),type='s',col=2)
lines(x,dpois(x,5),type='s',col=3)
x <- 20:80
plot(x,dpois(x,50),type='s',ylim=c(0,0.08)) #Poisson with mean 50 (variance = 50)
lines(x+0.5,dnorm(x,50,sqrt(50)),col=2) #Normal with mean and variance of 50
lines(x,dbinom(x,100,0.5),col=3,type='s') #Binomial with mean 50 (variance = 25)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
plot(x,dbinom(x,100,0.5),type='s',col=3) #Binomial with mean 50 (variance = 25)
lines(x+0.5,dnorm(x,50,sqrt(25)),col=2) #Normal with mean 50 and variance of 25
lines(x,dpois(x,50),col=1,type='s') #Poisson with mean 50 (variance = 50)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
The last two panels depict a comparison of the Poisson, Normal, and Binomial with the same mean and a large sample size. The Poisson and Binomial are identical in the two figures but in the first the normal has the same variance as the Poisson and the second it is the same as the binomial.
11) Normal distributions are often applied to count data even though count data can only take on positive integer values. Is this fair is this to do in these two examples? (i.e. how good is the normal approximation)
12) Would the normal be a fair approximation to the Poisson curves for small numbers (the first panel)? How about for the Bionomial for small numbers (earlier panel of figures on the Binomial)?
13) Is the Poisson a good approximation of the Binomial?
14) Is it possible to choose the parameters so that the Poisson and Binomial to both have the same mean and variance? If so what is this parameterization?
The negative binomial has two interesting interpretations that are subtlety different from either the Poisson or Binomial. In the first case, it is the number of trials needed in order to observe a fixed number of occurrences, which is the opposite from the Binomial’s number of occurrences in a fixed trial size and thus where it gets its name. The Negative Binomial also arises as the distribution of number of events that occur in a fixed space or time when the rate is not constant (as in the Poisson) but varies according to a Gamma distribution. Hence the Negative Binomial is also used to describe data that logically seems to come from a Poisson process but has greater variability that is expected from the Poisson (which by definition has a variance equal to its mean). The Geometric distribution arises as a special case of the negative binomial where the number of occurrences is fixed at 1.
x <- 0:20
## negative binomial
## vary size
plot(x,dnbinom(x,1,0.5),type="s",main="vary size")
lines(x,dnbinom(x,2,0.5),type="s",col=2)
lines(x,dnbinom(x,3,0.5),type="s",col=3)
lines(x,dnbinom(x,5,0.5),type="s",col=4)
lines(x,dnbinom(x,10,0.5),type="s",col=5)
legend("topright",legend=c(1,2,3,5,10),col=1:5,lty=1)
## vary probability
plot(x,dnbinom(x,3,0.5),type="s",main="vary probability")
lines(x,dnbinom(x,3,0.3),type="s",col=2)
lines(x,dnbinom(x,3,0.2),type="s",col=3)
lines(x,dnbinom(x,3,0.1),type="s",col=4)
legend("topright",legend=c(0.5,0.3,0.2,0.1),col=1:5,lty=1)
## vary variance , alternate parameterization
mean = 8
var = c(10,20,30)
size = mean^2/(var-mean)
plot(x,dnbinom(x,mu=mean,size=size[1]),type="s",ylim=c(0,0.14),main="vary variance")
lines(x,dnbinom(x,mu=mean,size=size[2]),type="s",col=2)
lines(x,dnbinom(x,mu=mean,size=size[3]),type="s",col=3)
legend('topright',legend=format(c(var,mean),digits=2),col=1:4,lty=1)
lines(x,dpois(x,mean),col=4,type="s")
## NB as generalization of pois with inflated variance
## geometric
plot(x,dgeom(x,0.5),type="s",main="Geometric")
lines(x,dgeom(x,0.15),type="s",col=2)
lines(x,dgeom(x,0.05),type="s",col=3)
lines(x,dnbinom(x,1,0.15),type="s",col=4,lty=2)
## geometric as special case of NB where size = 1
15) In the 'vary size' panel, what are the means of the curves?
16) In the “vary variance” panel, how does the shape of the Negative Binomial compare to the Poisson?